fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)#718
fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)#718Arsene12358 merged 4 commits intomainfrom
Conversation
Create version-specific collector files for vLLM >= 0.17.0, isolating framework version compat from the existing collectors (which continue to serve vLLM < 0.17.0 unchanged). New files: - collect_mla_module_v2.py: deterministic no-RNG init to avoid CUDA graph RNG corruption from DSA modules (vllm#39371), auto_map stripping, KV cache scale buffer init - collect_moe_v2.py: shared VllmConfig + set_forward_context for MoERunner compat (vllm#32344), pcp_size=1, is_gated_activation Registry changes: - moe, mla_*_module, dsa_*_module ops now use VersionRoute to route to v2 files on vLLM >= 0.17.0, falling back to originals otherwise Signed-off-by: Simone Chen <simonec@nvidia.com>
Collected with v2 collector files (0 DSA/MLA module errors): - dsa_context_module_perf.txt: 9297 lines - dsa_generation_module_perf.txt: 14905 lines - mla_context_module_perf.txt: 5425 lines (new) - mla_generation_module_perf.txt: 5665 lines (new) - moe_perf.txt: 38152 lines Signed-off-by: Simone Chen <simonec@nvidia.com>
Sanity Check Chart Generation Report📥 Download all sanity charts from workflow artifacts New perf data files were detected in this PR. Please use the link above to Below is a report of whether the chart generation was successful for each op. Chart Generation Report for system: b200_sxm, backend: vllm, backend_version: 0.17.0
Chart Generation Report for system: b200_sxm, backend: vllm, backend_version: 0.19.0
command / stdout / stderr |
WalkthroughThe pull request adds two new vLLM benchmarking collector scripts for MLA (multi-head latent attention) and MoE (mixture-of-experts) modules with multi-backend quantization support, updates the registry to enable version-aware module selection, and refreshes performance baseline data via Git LFS. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
collector/vllm/collect_mla_module_v2.py (2)
88-121: Temp directories are not cleaned up.The temp directories created by
mkdtemp()are cached in_local_config_cachebut never removed. For a collector that runs many test cases in a single process, this is likely acceptable (OS cleans/tmpperiodically). However, if this concern is raised:💡 Optional: Register cleanup with atexit
import atexit import shutil def _cleanup_temp_dirs(): for tmp_dir in _local_config_cache.values(): try: shutil.rmtree(tmp_dir, ignore_errors=True) except Exception: pass atexit.register(_cleanup_temp_dirs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@collector/vllm/collect_mla_module_v2.py` around lines 88 - 121, _temp directories created by _resolve_model_path via tempfile.mkdtemp are cached in _local_config_cache but never cleaned up; add a cleanup routine and register it with atexit to remove those temp dirs on process exit (use shutil.rmtree with ignore_errors=True) and ensure the routine iterates over _local_config_cache values; implement a helper function (e.g. _cleanup_temp_dirs) and call atexit.register(_cleanup_temp_dirs) so mkdtemp-created dirs are removed when the process ends.
415-438: Minor comment/code mismatch on scale initialization.Comment on line 417 says "Scale params → 1.0" but line 435 uses
fill_(0.5). The 0.5 value works fine (avoids NaN during processing), but the comment is slightly misleading.📝 Suggested fix
# Initialize with random weights. # FP8 weights → zero (safe dummy value). - # Scale params → 1.0 (avoid NaN during process_weights_after_loading). + # Scale params → 0.5 (avoid NaN during process_weights_after_loading). # Everything else → small constant.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@collector/vllm/collect_mla_module_v2.py` around lines 415 - 438, The comment and code disagree about the initial value for "scale" params: the comment says "Scale params → 1.0" but the loop in attn_module initialization sets scale tensors with tensor.data.fill_(0.5); update the comment to state "Scale params → 0.5" (or change the fill_ call to 1.0 if you prefer that behavior) so the documentation matches the implementation; locate the loop that iterates over attn_module.named_parameters()/named_buffers() and the branch that checks tensor.dtype == torch.float32 and "scale" in name to make the change.collector/vllm/collect_moe_v2.py (1)
248-251: Consider using deterministic initialization for bias tensors.The PR objectives note that vLLM 0.17.0 has CUDA graph RNG offset tracking issues. While
collect_mla_module_v2.pyusesfill_()to avoid RNG calls, this code usesnormal_()for bias initialization. If the MXFP4 path is used after DSA module collection in the same process, this could trigger RNG offset errors.Since bias values don't affect kernel latency, consider using
fill_()for consistency:🛡️ Suggested safer initialization
if hasattr(moe_module, "w13_bias"): - moe_module.w13_bias.data.normal_() + moe_module.w13_bias.data.fill_(0.01) if hasattr(moe_module, "w2_bias"): - moe_module.w2_bias.data.normal_() + moe_module.w2_bias.data.fill_(0.01)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@collector/vllm/collect_moe_v2.py` around lines 248 - 251, The bias initialization in collect_moe_v2.py uses nondeterministic normal_() on moe_module.w13_bias and moe_module.w2_bias which can break CUDA graph RNG offset tracking; change these to deterministic in-place fills (e.g., use .data.fill_(0) or another fixed constant) inside the same attribute checks for moe_module.w13_bias and moe_module.w2_bias so no RNG is invoked during module collection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@collector/vllm/collect_mla_module_v2.py`:
- Around line 88-121: _temp directories created by _resolve_model_path via
tempfile.mkdtemp are cached in _local_config_cache but never cleaned up; add a
cleanup routine and register it with atexit to remove those temp dirs on process
exit (use shutil.rmtree with ignore_errors=True) and ensure the routine iterates
over _local_config_cache values; implement a helper function (e.g.
_cleanup_temp_dirs) and call atexit.register(_cleanup_temp_dirs) so
mkdtemp-created dirs are removed when the process ends.
- Around line 415-438: The comment and code disagree about the initial value for
"scale" params: the comment says "Scale params → 1.0" but the loop in
attn_module initialization sets scale tensors with tensor.data.fill_(0.5);
update the comment to state "Scale params → 0.5" (or change the fill_ call to
1.0 if you prefer that behavior) so the documentation matches the
implementation; locate the loop that iterates over
attn_module.named_parameters()/named_buffers() and the branch that checks
tensor.dtype == torch.float32 and "scale" in name to make the change.
In `@collector/vllm/collect_moe_v2.py`:
- Around line 248-251: The bias initialization in collect_moe_v2.py uses
nondeterministic normal_() on moe_module.w13_bias and moe_module.w2_bias which
can break CUDA graph RNG offset tracking; change these to deterministic in-place
fills (e.g., use .data.fill_(0) or another fixed constant) inside the same
attribute checks for moe_module.w13_bias and moe_module.w2_bias so no RNG is
invoked during module collection.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 46e56a3a-7d6a-4c36-b281-2960f156d8ac
📒 Files selected for processing (8)
collector/vllm/collect_mla_module_v2.pycollector/vllm/collect_moe_v2.pycollector/vllm/registry.pysrc/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txtsrc/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txtsrc/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txtsrc/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_generation_module_perf.txtsrc/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/moe_perf.txt
Combined gemm and moe performance data from two pipeline runs. Attention/MLA/DSA collection had errors and is not included. Signed-off-by: Simone Chen <simonec@nvidia.com>
Rename collect_moe.py -> collect_moe_v1.py and collect_mla_module.py -> collect_mla_module_v1.py to satisfy the test_versioned_modules_use_vn_suffix registry integrity check. Signed-off-by: Simone Chen <simonec@nvidia.com>
* feat: vLLM 0.17.0 collector v2 files with version routing Create version-specific collector files for vLLM >= 0.17.0, isolating framework version compat from the existing collectors (which continue to serve vLLM < 0.17.0 unchanged). New files: - collect_mla_module_v2.py: deterministic no-RNG init to avoid CUDA graph RNG corruption from DSA modules (vllm#39371), auto_map stripping, KV cache scale buffer init - collect_moe_v2.py: shared VllmConfig + set_forward_context for MoERunner compat (vllm#32344), pcp_size=1, is_gated_activation Registry changes: - moe, mla_*_module, dsa_*_module ops now use VersionRoute to route to v2 files on vLLM >= 0.17.0, falling back to originals otherwise Signed-off-by: Simone Chen <simonec@nvidia.com> * data: add clean vLLM 0.17.0 perf data from v2 collector (job 295500035) Collected with v2 collector files (0 DSA/MLA module errors): - dsa_context_module_perf.txt: 9297 lines - dsa_generation_module_perf.txt: 14905 lines - mla_context_module_perf.txt: 5425 lines (new) - mla_generation_module_perf.txt: 5665 lines (new) - moe_perf.txt: 38152 lines Signed-off-by: Simone Chen <simonec@nvidia.com> * data: add vLLM 0.19.0 perf data for b200_sxm (gemm + moe) Combined gemm and moe performance data from two pipeline runs. Attention/MLA/DSA collection had errors and is not included. Signed-off-by: Simone Chen <simonec@nvidia.com> * fix: rename vllm collector modules to follow _vN suffix convention Rename collect_moe.py -> collect_moe_v1.py and collect_mla_module.py -> collect_mla_module_v1.py to satisfy the test_versioned_modules_use_vn_suffix registry integrity check. Signed-off-by: Simone Chen <simonec@nvidia.com> --------- Signed-off-by: Simone Chen <simonec@nvidia.com>
Overview:
Fix vLLM 0.17.0 collector compatibility for DSA module, MLA module, and MoE MXFP4 benchmarks on B200. Uses version-routed v2 collector files to isolate 0.17.0 changes from existing collectors.
Details:
DSA module collector (
collect_mla_module_v2.py):DeepseekV2MLAAttentionconstruction. Any subsequent RNG operation crashes with"Offset increment outside graph capture".enforce_eagerandmanual_seed()do not clear the state — the corruption originates inside module constructionnormal_,uniform_,randn,randint) with deterministicfill_()/torch.full()process_weights_after_loading()anywayk_scale/v_scaleas buffers, not parameters. The init loop missed them, leaving sentinel values that failprocess_weights_after_loading()(k_scale > 0.0assertion).config.jsonhasauto_mappointing toconfiguration_deepseek.py. HuggingFace'sAutoConfig.from_pretrained()(called by vLLM'sModelConfig) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture.MoE MXFP4 collector (
collect_moe_v2.py):FusedMoE.forward()throughget_forward_context()→get_layer_from_name(), requiring the module to be registered instatic_forward_context. Share the sameVllmConfigbetweenFusedMoE.__init__and the benchmark'sset_forward_context()so the registration is visible.FusedMoE(vllm#32344). Passpcp_size=1to avoidget_pcp_group()which requires distributed init.is_gated_activation=Truetoprepare_static_weights_for_trtllm_fp4_moe()(GPT-OSS uses SwiGLU).Version routing (
registry.py):moe,mla_*_module,dsa_*_moduleops useVersionRouteto route to v2 files on vLLM >= 0.17.0, falling back to originals otherwiseData — clean collection from job 295500035 (0 DSA/MLA module errors). Adds previously missing
mla_context_module_perf.txtandmla_generation_module_perf.txt.Known limitations:
weight_scale_vec_sizeerrors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fixtp_size > 1fail atFusedMoE.__init__— requires distributed init not available in standalone collectorcollect_mla.py) fix deferred — vLLM 0.17.0 changed theFlashInferMLAImplforward APIWhere should the reviewer start?
collector/vllm/registry.py→collector/vllm/collect_mla_module_v2.pySummary by CodeRabbit
New Features
Improvements